Japanese Effort Toward Sharing Text and Speech Corpora

نویسندگان

  • Shuichi Itahashi
  • Kôiti Hasida
چکیده

This report introduces the activities of the two organizations related to collection and distribution of text and speech corpora in Japan. One is the Language Resource Association (GSK) and the other is NII-Speech Resources Consortium (NII-SRC).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Automatic Transformation between Different Transcription Conventions: Prediction of Intonation Markers from Linguistic and Acoustic Features

Because of the tremendous effort required for recording and transcription, large-scale spoken language corpora have been hardly developed in Japanese, with a notable exception of the Corpus of Spontaneous Japanese (CSJ). Various research groups have individually developed conversation corpora in Japanese, but these corpora are transcribed by different conventions and have few annotations in com...

متن کامل

Japanese Dialogue Corpus of Multi-Level Annotation

This paper describes a Japanese dialogue corpus annotated with multi-level information built by the Japanese Discourse Research Initiative, Japanese Society for Artificial Intelligence. The annotation information consists of speech, transcription delimited by slash units, prosodic, part of speech, dialogue acts and dialogue segmentation. In the project, we used the corpus for obtaining new find...

متن کامل

Construction of Chinese Segmented and POS-tagged Conversational Corpora and Their Evaluations on Spontaneous Speech Recognitions

The performance of a corpus-based language and speech processing system depends heavily on the quantity and quality of the training corpora. Although several famous Chinese corpora have been developed, most of them are mainly written text. Even for some existing corpora that contain spoken data, the quantity is insufficient and the domain is limited. In this paper, we describe the development o...

متن کامل

The Mega-Word Tagged-Corpus Project

Large corpora with part-of-speech tagging play a very important role in recent statisticsbased and example-based natural language processing systems. However, no such corpora have become widely available for Japanese so far. Because the Japanese language has no explicit word boundaries, it is impossible even to count words without a corpus that has at. least word segmentations. This paper descr...

متن کامل

Disfluency patterns in dialogue processing

Spontaneous speech abounds with disfluencies such as filled pauses, repairs, repetitions, false start and prolongations, all of which are significant but easily overlooked features of speech communication. Based on the comparable corpora of English and Japanese dialogues, we argue that disfluency features can have a positive effect on turn-taking issues and the establishment of common referring...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008